NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Robust Distance Correlation for Variable Screening

https://doi.org/10.1002/sta4.70094

Ma, Tianzhou; Yang, Fan; Ke, Hongjie; Ren, Zhao (September 2025, Stat)

ABSTRACT In modern statistical applications, identifying critical features in high‐dimensional data is essential for scientific discoveries. Traditional best subset selection methods face computational challenges, while regularization approaches such as Lasso, SCAD and their variants often exhibit poor performance with ultrahigh‐dimensional data. Sure screening methods, widely used for dimensionality reduction, have been developed as popular alternatives, but few target heavy‐tailed characteristics in modern big data. This paper introduces a new sure screening method, based on robust distance correlation (‘RDC’), designed for heavy‐tailed data. The proposed method inherits the benefits of the original model‐free distance correlation‐based screening while robustly estimating distance correlation in the presence of heavy‐tailed data. We further develop an FDR control procedure by incorporating the Reflection via Data Splitting (REDS) method. Extensive simulations demonstrate the method's advantage over existing screening procedures under different scenarios of heavy‐tailedness. Its application to high‐dimensional heavy‐tailed RNA‐seq data from The Cancer Genome Atlas (TCGA) pancreatic cancer cohort showcases superior performance in identifying biologically meaningful genes predictive of MAPK1 protein expression critical to pancreatic cancer.
more » « less
Free, publicly-accessible full text available September 1, 2026
A multivariate to multivariate approach for voxel‐wise genome‐wide association analysis

https://doi.org/10.1002/sim.10101

Wu, Qiong; Zhang, Yuan; Huang, Xiaoqi; Ma, Tianzhou; Hong, L Elliot; Kochunov, Peter; Chen, Shuo (August 2024, Statistics in Medicine)

The joint analysis of imaging‐genetics data facilitates the systematic investigation of genetic effects on brain structures and functions with spatial specificity. We focus on voxel‐wise genome‐wide association analysis, which may involve trillions of single nucleotide polymorphism (SNP)‐voxel pairs. We attempt to identify underlying organized association patterns of SNP‐voxel pairs and understand the polygenic and pleiotropic networks on brain imaging traits. We propose abi‐cliquegraph structure (ie, a set of SNPs highly correlated with a cluster of voxels) for the systematic association pattern. Next, we develop computational strategies to detect latent SNP‐voxelbi‐cliquesand an inference model for statistical testing. We further provide theoretical results to guarantee the accuracy of our computational algorithms and statistical inference. We validate our method by extensive simulation studies, and then apply it to the whole genome genetic and voxel‐level white matter integrity data collected from 1052 participants of the human connectome project. The results demonstrate multiple genetic loci influencing white matter integrity measures on splenium and genu of the corpus callosum.
more » « less
Full Text Available
A prototype early warning system for diarrhoeal disease to combat health threats of climate change in the asia-pacific region

https://doi.org/10.1088/1748-9326/ad8366

Cruz_Cano, Raul; He, Hao; Aryal, Samyam; Dhimal, Megnath; Thu, Dang_Thi_Anh; Zhang, Linus; Ma, Tianzhou; Liang, Xin-Zhong; Murtugudde, Raghu; Gao, Chuansi; et al (October 2024, Environmental Research Letters)

Abstract Ongoing climate variability and change are increasing the burden of diarrhoeal disease worldwide. Meaningful early warning systems with adequate lead times (weeks to months) are needed to guide public health decision–making and enhance community resilience against health threats posed by climate change. Toward this goal, we trained various machine-learning models to predict diarrhoeal disease rates in Nepal (2002–2014), Taiwan (2008–2019), and Vietnam (2000–2015) using temperature, precipitation, previous disease rates, and El Niño Southern Oscillation phases. We also compared the performance of shallow time-series neural network (NN), Random Forest Regressor, artificial nn, gradient boosting regressor, and long short-term memory–based methods for their effectiveness in predicting diarrhoeal disease burden across multiple countries. We evaluated model performance using a test dataset and assessed the accuracy of predicted diarrhoeal disease incidence rates for the last year of available data in each district. Our results suggest that even in the absence of the most recent disease surveillance data, a likely scenario in most low- and middle-income countries, our NN-based early warning system using historical data performs reasonably well. However, future studies are needed to perform prospective evaluations of such early warning systems in real-world settings.
more » « less
High-dimension to high-dimension screening for detecting genome-wide epigenetic and noncoding RNA regulators of gene expression

https://doi.org/10.1093/bioinformatics/btac518

Ke, Hongjie; Ren, Zhao; Qi, Jianfei; Chen, Shuo; Tseng, George_C; Ye, Zhenyao; Ma, Tianzhou; Alkan, ed., Can (July 2022, Bioinformatics)

Abstract MotivationThe advancement of high-throughput technology characterizes a wide variety of epigenetic modifications and noncoding RNAs across the genome involved in disease pathogenesis via regulating gene expression. The high dimensionality of both epigenetic/noncoding RNA and gene expression data make it challenging to identify the important regulators of genes. Conducting univariate test for each possible regulator–gene pair is subject to serious multiple comparison burden, and direct application of regularization methods to select regulator–gene pairs is computationally infeasible. Applying fast screening to reduce dimension first before regularization is more efficient and stable than applying regularization methods alone. ResultsWe propose a novel screening method based on robust partial correlation to detect epigenetic and noncoding RNA regulators of gene expression over the whole genome, a problem that includes both high-dimensional predictors and high-dimensional responses. Compared to existing screening methods, our method is conceptually innovative that it reduces the dimension of both predictor and response, and screens at both node (regulators or genes) and edge (regulator–gene pairs) levels. We develop data-driven procedures to determine the conditional sets and the optimal screening threshold, and implement a fast iterative algorithm. Simulations and applications to long noncoding RNA and microRNA regulation in Kidney cancer and DNA methylation regulation in Glioblastoma Multiforme illustrate the validity and advantage of our method. Availability and implementationThe R package, related source codes and real datasets used in this article are provided at https://github.com/kehongjie/rPCor. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
Detecting survival-associated biomarkers from heterogeneous populations

https://doi.org/10.1038/s41598-021-82332-y

Saegusa, Takumi; Zhao, Zhiwei; Ke, Hongjie; Ye, Zhenyao; Xu, Zhongying; Chen, Shuo; Ma, Tianzhou (December 2021, Scientific Reports)
null (Ed.)
Abstract Detection of prognostic factors associated with patients’ survival outcome helps gain insights into a disease and guide treatment decisions. The rapid advancement of high-throughput technologies has yielded plentiful genomic biomarkers as candidate prognostic factors, but most are of limited use in clinical application. As the price of the technology drops over time, many genomic studies are conducted to explore a common scientific question in different cohorts to identify more reproducible and credible biomarkers. However, new challenges arise from heterogeneity in study populations and designs when jointly analyzing the multiple studies. For example, patients from different cohorts show different demographic characteristics and risk profiles. Existing high-dimensional variable selection methods for survival analysis, however, are restricted to single study analysis. We propose a novel Cox model based two-stage variable selection method called “Cox-TOTEM” to detect survival-associated biomarkers common in multiple genomic studies. Simulations showed our method greatly improved the sensitivity of variable selection as compared to the separate applications of existing methods to each study, especially when the signals are weak or when the studies are heterogeneous. An application of our method to TCGA transcriptomic data identified essential survival associated genes related to the common disease mechanism of five Pan-Gynecologic cancers.
more » « less
Full Text Available
Variable screening with multiple studies

https://doi.org/10.5705/ss.202017.0439

Ma, Tianzhou; Ren, Zhao; Tseng, George C. (January 2020, Statistica Sinica)

Full Text Available

Search for: All records